Question and Data

During the COVID pandemic, the Centers for Disease and Control and Prevention (CDC) published an article showing that the obese population was more likely to die or have further complications due to the virus. There are countless other similar studies that indicate the severity of the health problem that is obesity.

The United States is known for being a land of great opportunity, wealth and freedom. Along with that, however, it is associated with a sedentary lifestyle, fast food restaurants, and obesity. This stereotype doesn’t come from nowhere; the United States has an obesity rate of 36.5%, making the country the most obese among developed countries. Within the limits of the conclusions that the field of data science can make, we decided to answer: Is the availability of fast food restaurants in the USA associated with higher obesity rates? We will also attempt to discover other variables that could be associated with obesity rates and determine how correlated they seem to be with it.

Data

We used data from the Food Environment Atlas by the U.S Department of Agriculture collected in years ranging from 2013 to 2015. They use the Behavioral Risk Factor Surveillance System (BRFSS), the U.S. Census, and the USDA’s Economic Research Service as their sources and they organize their data at the county level. Therefore, we have used each county as one data point. We believe that the fact the data is from different years may cause a slight alteration in the results of our model. This should be negligible since they’re only at most two years apart.

We picked a few variables that we thought could be relevant in determining obesity rates:

obesrate = rate of obesity in each county in the US, 2013
fasfoo = fast-food restaurants per 1000 people in each county in the US, 2014
medinc = median household income, 2015
diab = rate of diabetes in each county in the US, 2013
fitplace = recreation & fitness facilities per 1000 people in each county in the US, 2014
loaccgro = percent of access to grocery stores in each county in the US, 2015.

https://www.ers.usda.gov/data-products/food-environment-atlas/data-access-and-documentation-downloads/

Methods

To answer our question, we chose to use a two step process. For the first step, we decided to split the initial data set by clustering using k-means. This would allow our models in the second step of the process to make better predictions. Plus, separating our data to create different models will prevent over-fitting. Clustering may also offer additional insights in our data. For example, we can use visualizations to identify new patterns by location.

We clustered on the variables mentioned above (fasfoo, medinc, diab, fitplace, loaccgro), excluding obesity rate as this is our target variable.

Based on the elbow chart below, 4 centers looked the best for k-means. For each of the 4 clusters, we created 4 new data sets, leading us to the second step of our two-step process. For each data set, we trained a random forest (RF) regression model, evaluating them using mean-squared error (MSE). RF also grants us the ability to see variable importance, helping us answer our question of which features are the best predictors of obesity rate. RF’s benefits of bagging and boosting to mitigate under-fitting and over-fitting also made it an appealing option.

Exploratory Data Analysis

summary(county_data)
##     state              county             loaccgro          fasfoo       
##  Length:3120        Length:3120        Min.   :0.0000   Min.   :0.00000  
##  Class :character   Class :character   1st Qu.:0.1094   1st Qu.:0.03520  
##  Mode  :character   Mode  :character   Median :0.1920   Median :0.04875  
##                                        Mean   :0.2307   Mean   :0.05627  
##                                        3rd Qu.:0.2886   3rd Qu.:0.06490  
##                                        Max.   :1.0000   Max.   :1.00000  
##       diab           obesrate         fitplace           medinc      
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.3069   1st Qu.:0.4609   1st Qu.:0.00000   1st Qu.:0.1704  
##  Median :0.3861   Median :0.5419   Median :0.07431   Median :0.2321  
##  Mean   :0.3926   Mean   :0.5367   Mean   :0.08391   Mean   :0.2494  
##  3rd Qu.:0.4752   3rd Qu.:0.6145   3rd Qu.:0.12934   3rd Qu.:0.3037  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.00000   Max.   :1.0000

Since we use a two step data analysis process, first clustering the data and then running separate models in each cluster, we thought this would be a good opportunity to explore these clusters a little more.

We clustered our parameters in 5 dimensions: one for each of the explanatory variables. Because of that, it would require several plots to show the cluster themselves with little to take away from them. We believed that showing a bar plot with the average of each parameter in each cluster would be more relevant and informative in a simpler manner.

Cluster 1:
Clustered the wealthier counties, with lower diabetes rates highest proportions of access to fitness establishments.
Cluster 2:
Clustered counties with high access to groceries and lowest fitness establishments, mainly.
Cluster 3:
Clustered counties with lowest rates of income, highest rates of diabetes and lowest rate of access to groceries.
Cluster 4:
Seems to have the most balance of all clusters.

Next, we thought that it could be good to see how these clusters are displayed geographically in the country. We calculated the percentage of counties per state that are in each cluster and plotted that information.

Cluster 1:

Mostly New England and California. This is the wealthier cluster, so it definitely makes sense.

Cluster 2:

There are extremely few data points in this cluster. It seems that most of the places where a high percentage of the population has access to groceries is the southwest and the northwest of the country.

Cluster 3:

Concentrates a lot of the southern states aside from Florida and Texas.

Cluster 4:

Seems to be the almost evenly distributed across the country, which makes sense given the bar plot.

Evaluation of our Models

Cluster 1 Evaluation

##               %IncMSE IncNodePurity
## loaccgro 0.0002616216     0.1420106
## fasfoo   0.0019643544     0.2606380
## diab     0.0055175303     0.4533984
## fitplace 0.0006342932     0.1598106
## medinc   0.0012446579     0.2191012

Diabetes was the most important in predicting the obesity rate, which makes sense since these would appear to be highly correlated variables at first glance.

Random Forest MSE:

## [1] 0.006136722

MSE using obesity rate’s mean as prediction:

## [1] 0.01350612

Cluster 2 Evaluation

Cluster 2 has very few data points. Because of this I will use 85% of the data as training data and tuning and the rest will be used for testing.

This was a very interesting cluster as it was supposed to concentrate counties with a really high degree of the population with access to groceries and counties with low access to gyms. Initially I believed these features wouldn’t be correlated negatively, but it seems they are, at least in the national level.

##           %IncMSE IncNodePurity
## loaccgro 0.000569    0.14883766
## fasfoo   0.001219    0.21474863
## diab     0.002749    0.37094615
## fitplace 0.000086    0.01696633
## medinc   0.001287    0.25967360

Random Forest MSE:

## [1] 0.01233427

MSE using obesity rate’s mean as prediction:

## [1] 0.01798417

The mean squared error of the sample was really low. This is to be expected, since we clustered data into four groups that are similar in determining aspects to obesity rate.

If instead of training the data set, we just guessed the average obesity rate of the training data the MSE would be of 0.01247393. Using the data to train it using the random forest method, gives us a 0.009291936 MSE, which is about 30% lower.

Given there’s such few data points, especially for the testing portion of this, I believe that this is a good result.

Non-surprisingly, the rate of people with diabetes in the county was the most important predictor. Next, the measure of fast food restaurants in the county and median income basically tied as second most relevant measures, and access to grocery stores came in fourth. Access to gyms had basically no relevance at all.

Cluster 3 Evaluation

##               %IncMSE IncNodePurity
## loaccgro 0.0004148564    0.14368152
## fasfoo   0.0008994852    0.17054091
## diab     0.0041996768    0.32150511
## fitplace 0.0001376759    0.06746348
## medinc   0.0007539823    0.16368026

Random Forest MSE:

## [1] 0.007061758

MSE using obesity rate’s mean as prediction:

## [1] 0.01176137

The random forest model performed better than just guessing the mean obesity rate in both the test and train set once again.

Cluster 4 Evaluation

##               %IncMSE IncNodePurity
## loaccgro 1.104834e-04    0.13156826
## fasfoo   1.219760e-03    0.22510925
## diab     5.012530e-03    0.38400445
## fitplace 3.298028e-05    0.08183718
## medinc   7.854047e-04    0.22025044

Random Forest MSE:

## [1] 0.008394561

MSE using obesity rate’s mean as prediction:

## [1] 0.01205747

Fairness Assessment

On the fairness side of things, there isn’t too much to look into. When a model’s fairness is to be evaluated, the most important thing to examine is how it treats protected classes. These protected classes could include race, gender, and so on. These could also include proxies for race, gender, etc. such as family statistics, education, income, and so on. However, our model only includes median income as one feature, and that itself wasn’t rated very highly in the importance metrics for three out of our four clusters. Therefore, fairness should not be too much of an issue in our model.

Conclusions

Based on each of our random forest models, rate of diabetes in a county was clearly the best predictor of obesity rate as expected. After diabetes, the number of fast-food restaurants per 1000 people in each county was the next best predictor, ranking 2nd in variable importance for each cluster except the second, for which it basically tied with median household income basically 2nd place.

Meanwhile, for every cluster except cluster 1, the number of recreation/fitness centers per 1000 people was the least useful in predicting obesity rate, while the percent of access to grocery stores ranked last in cluster 1.

In conclusion, given the correlation shown by our model, it looks like factors that may contribute to someone becoming obese (i.e. fast-food restaurants) are likely to be a better indicator than factors that would help prevent someone from becoming obese (i.e. fitness center).

Future Work

Having additional factors would have been beneficial to look into, considering we only had 5 features. Our features were also more on the obvious side of predicting obesity, considering the outcomes weren’t all too surprising. Including variables such as population demographics (i.e. age, ethnicity, etc.), population density, screen time (i.e. iPhone screen time), number of registered vehicles per household, and public transportation budget could be interesting to look into.

Furthermore, we could split our initial data set up into different ways. For instance, we could include more or less clusters. Or we could intentionally cluster certain counties together to make up specific subgroups such as organizing by state, region, time zone, etc.